17 research outputs found

    Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

    Get PDF
    Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. It is also observed that the FPGA performs increasingly better as a vision application’s pipeline complexity grows

    A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

    Full text link
    Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96x speedup with 2.97% higher accuracy. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models

    Tailor: Altering Skip Connections for Resource-Efficient Inference

    Full text link
    Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network's skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture

    Microscaling Data Formats for Deep Learning

    Full text link
    Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe

    Performance and Complexity Co-evaluation of the Advanced Video Coding Standard for Cost-Effective Multimedia Communications

    No full text
    <p/> <p>The advanced video codec (AVC) standard, recently defined by a joint video team (JVT) of ITU-T and ISO/IEC, is introduced in this paper together with its performance and complexity co-evaluation. While the basic framework is similar to the motion-compensated hybrid scheme of previous video coding standards, additional tools improve the compression efficiency at the expense of an increased implementation cost. As a first step to bridge the gap between the algorithmic design of a complex multimedia system and its cost-effective realization, a high-level co-evaluation approach is proposed and applied to a real-life AVC design. An exhaustive analysis of the codec compression efficiency versus complexity (memory and computational costs) project space is carried out at the early algorithmic design phase. If all new coding features are used, the improved AVC compression efficiency (up to 50% compared to current video coding technology) comes with a complexity increase of a factor 2 for the decoder and larger than one order of magnitude for the encoder. This represents a challenge for resource-constrained multimedia systems such as wireless devices or high-volume consumer electronics. The analysis also highlights important properties of the AVC framework allowing for complexity reduction at the high system level: when combining the new coding features, the implementation complexity accumulates, while the global compression efficiency saturates. Thus, a proper use of the AVC tools maintains the same performance as the most complex configuration while considerably reducing complexity. The reported results provide inputs to assist the profile definition in the standard, highlight the AVC bottlenecks, and select optimal trade-offs between algorithmic performance and complexity.</p

    Exploiting the Expressiveness of Cyclo-Static Dataflow to Model Multimedia Implementations

    Get PDF
    The design of increasingly complex and concurrent multimedia systems requires a description at a higher abstraction level. Using an appropriate model of computation helps to reason about the system and enables design time analysis methods. The nature of multimedia processing matches in many cases well with cyclo-static dataflow (CSDF), making it a suitable model. However, channels in an implementation often use for cost reasons a kind of shared buffer that cannot be directly described in CSDF. This paper shows how such implementation specific aspects can be expressed in CSDF without the need for extensions. Consequently, the CSDF graph remains completely analyzable and allows reasoning about its temporal behavior. The obtained relation between model and implementation enables a buffer capacity analysis on the model while assuring the throughput of the final implementation. The capabilities of the approach are demonstrated by analyzing the temporal behavior of an MPEG-4 video encoder with a CSDF graph

    Scaling Bandwidth and Complexity of Hybrid MC/DCT Video

    No full text
    Abstract. Multi-user video conferencing has high and diverse bandwidth and performance requirements. We analyse the scaling behaviour of a standard hybrid video codec by varying the quantization degree, the temporal and spatial resolution. We define a framework to select the set of coding parameters when the bandwidth and complexity constraints are known. Additionally, the efficiency of simulcast and layered solutions to overcome large bandwidth and/or computational resource variations among the participants are compared. In this context, the temporal resolution turns out to be the best-suited parameter to provide multiple resolutions through independent streams dealing with the network unreliability.
    corecore